violation ratio
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Multivariate Stochastic Dominance via Optimal Transport and Applications to Models Benchmarking
Rioux, Gabriel, Nitsure, Apoorva, Rigotti, Mattia, Greenewald, Kristjan, Mroueh, Youssef
Stochastic dominance is an important concept in probability theory, econometrics and social choice theory for robustly modeling agents' preferences between random outcomes. While many works have been dedicated to the univariate case, little has been done in the multivariate scenario, wherein an agent has to decide between different multivariate outcomes. By exploiting a characterization of multivariate first stochastic dominance in terms of couplings, we introduce a statistic that assesses multivariate almost stochastic dominance under the framework of Optimal Transport with a smooth cost. Further, we introduce an entropic regularization of this statistic, and establish a central limit theorem (CLT) and consistency of the bootstrap procedure for the empirical statistic. Armed with this CLT, we propose a hypothesis testing framework as well as an efficient implementation using the Sinkhorn algorithm. We showcase our method in comparing and benchmarking Large Language Models that are evaluated on multiple metrics. Our multivariate stochastic dominance test allows us to capture the dependencies between the metrics in order to make an informed and statistically significant decision on the relative performance of the models.
- North America > United States > New York > New York County > New York City (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Learning the irreversible progression trajectory of Alzheimer's disease
Wang, Yipei, He, Bing, Risacher, Shannon, Saykin, Andrew, Yan, Jingwen, Wang, Xiaoqian
Alzheimer's disease (AD) is a progressive and irreversible brain disorder that unfolds over the course of 30 years. Therefore, it is critical to capture the disease progression in an early stage such that intervention can be applied before the onset of symptoms. Machine learning (ML) models have been shown effective in predicting the onset of AD. Yet for subjects with follow-up visits, existing techniques for AD classification only aim for accurate group assignment, where the monotonically increasing risk across follow-up visits is usually ignored. Resulted fluctuating risk scores across visits violate the irreversibility of AD, hampering the trustworthiness of models and also providing little value to understanding the disease progression. To address this issue, we propose a novel regularization approach to predict AD longitudinally. Our technique aims to maintain the expected monotonicity of increasing disease risk during progression while preserving expressiveness. Specifically, we introduce a monotonicity constraint that encourages the model to predict disease risk in a consistent and ordered manner across follow-up visits. We evaluate our method using the longitudinal structural MRI and amyloid-PET imaging data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Our model outperforms existing techniques in capturing the progressiveness of disease risk, and at the same time preserves prediction accuracy.
- North America > United States > Indiana (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Research Report > Experimental Study (0.47)
- Research Report > New Finding (0.46)
Risk Assessment and Statistical Significance in the Age of Foundation Models
Nitsure, Apoorva, Mroueh, Youssef, Rigotti, Mattia, Greenewald, Kristjan, Belgodere, Brian, Yurochkin, Mikhail, Navratil, Jiri, Melnyk, Igor, Ross, Jerret
Foundation models such as large language models (LLMs) have shown remarkable capabilities redefining the field of artificial intelligence. At the same time, they present pressing and challenging socio-technical risks regarding the trustworthiness of their outputs and their alignment with human values and ethics [Bommasani et al., 2021]. Evaluating LLMs is therefore a multi-dimensional problem, where those risks are assessed across diverse tasks and domains [Chang et al., 2023]. In order to quantify these risks, Liang et al. [2022], Wang et al. [2023], Huang et al. [2023] proposed benchmarks of automatic metrics for probing the trustworthiness of LLMs. These metrics include accuracy, robustness, fairness, toxicity of the outputs, etc. Human evaluation benchmarks can be even more nuanced, and are often employed when tasks surpass the scope of standard metrics. Notable benchmarks based on human and automatic evaluations include, among others, Chatbot Arena [Zheng et al., 2023], HELM [Bommasani et al., 2023], MosaicML's Eval, Open LLM Leaderboard [Wolf, 2023], and BIG-bench [Srivastava et al., 2022], each catering to specific evaluation areas such as chatbot performance, knowledge assessment, and domain-specific challenges. Traditional metrics, however, sometimes do not correlate well with human judgments.
- Information Technology > Security & Privacy (0.41)
- Energy > Oil & Gas > Upstream (0.34)